Red Wine Quality by Iulia Banescu

Introduction

The purpose of this project is to analyze the Red Wine dataset, in order to understand the relationship between various parameters that can impact wine’s quality.

Exploratory data analysis (EDA) techniques will be used to explore relationships in one variable to multiple variables and to explore the selected red wine dataset for distribution, outliers, and anomalies.

Dataset Overview

The dataset comprises of 1599 observations representing different red wine varieties and 13 variables.

## [1] 1599   13

The firts variable, X, represents the number id of the wine varienty, the other variables are different wine properties, while the last variable represents the quality scores - based on sensory data.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Attributes descriptions:

  1. Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. Chlorides: the amount of salt in the wine

  6. Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. Density: the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0(very acidic) to 14(very basic); most wines are between 3-4 on the pH scale

  10. Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. Alcohol: the percent alcohol content of the wine

  12. Quality: score between 0 and 10 (output variable - based on sensory data)

More information about the dataset can be found here.

Looking at the varibles type it can be seen that they are either numerical or integer. The variables related to the wine’s chemical propesties are continuous, while quality is an ordinal variable. To ease the analysis a new variable will be created where the quality will be converted into factor. Aditionally, another factor variable will be created by categorizing the quality variable into three rating groups:
- bad, if the quality score is below 5;
- average, if the quality score is 5 or 6;
- good, if the quality was 7 or above.

Questions

Through this exploratory analysis the following questions will be answered:
1. What is the most common wine quality?
2. Which wine characteristics have a higher impact on quality?
3. What relationships can be observed between wine caracteristics?

Univariate Plots Section

Before visualizing the data using graphs, a summary of the data will be displayed below.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality          rating    
##  Min.   :3.000   bad    :  63  
##  1st Qu.:5.000   average:1319  
##  Median :6.000   good   : 217  
##  Mean   :5.636                 
##  3rd Qu.:6.000                 
##  Max.   :8.000

The summary above provides a good overview on each variable. An interesting obsarvation is that, even though the scores for quality could range between 0 and 10, they are actually between 3 and 8 with a mean of 5.636. Looking at the rating variable, we can see that the majority of the tested wines have an average quality.

Another interesting observation is that for the total sulfur dioxide variable the minimum value is 6 and the maximum value is 289. With a mean of 46.47 and a median of 38, the maximum value could be an outlier. A similar situation can be observed for other variables like residual sugar and free sulfur dioxide.

Frequency Distribution

After seeing the data summary, the frequency distribution of variables will be displayed using histrograms.

The first variable to be investigated is the quality variable, since is the main focus of the analysis.

The histogram above shows that the most common wine quality is 5 with almost 700 wine varieties having this quality. The second most common quality is quality 6, that closely follows the previous with around 600 wine varieties. The third most common quality is 7 with around 200 wine varieties. As shown previously in the dataset summary, the quality scores range between 3 and 8, even though the possible options were between 0 and 10.

The next variables that will be displayed are the ones that have a normal or close to normal distribution.

Looking at the histrograms above it can be seen that density and pH have a clear normal distribution. While fixed acidity and volatile acidity both have a close to normal distribution.

The rest of the variables do not have a normal distribution. Using this function we can see what the impact of outliers on the dataset is.

## Outliers identified: 155 nPropotion (%) of outliers: 10.7 nMean of the outliers: 5.88 nMean without removing outliers: 2.54 nMean if we remove outliers: 2.18 n

For the residual sugar variable 155 (11%) outliers have been identified. In the initial initial graph we can see that the distribution is right skewed. After the elimination of the outliers the distribution became close to normal.

## Outliers identified: 112 nPropotion (%) of outliers: 7.5 nMean of the outliers: 0.2 nMean without removing outliers: 0.09 nMean if we remove outliers: 0.08 n

For the chlorhide variable 112 (7.5%) outliers have been identified. If initially the distribution was right skewed, without outliers the distribution became normal.

## Outliers identified: 30 nPropotion (%) of outliers: 1.9 nMean of the outliers: 51.9 nMean without removing outliers: 15.87 nMean if we remove outliers: 15.19 n

For the free sulfur dioxide varibale only 30 (2%) outliers have been identified. Since the number is so small, the impact on the distribution is also very small. As it can be seen in the graphs above, the distribution is right skewed with and without outliers.

## Outliers identified: 55 nPropotion (%) of outliers: 3.6 nMean of the outliers: 143.89 nMean without removing outliers: 46.47 nMean if we remove outliers: 43 n

The situation is similar for total sulfur dioxide variable. The variable has only 55 (4%) outliers, therefore, as can be seen in the graphs, the impact on distribution is small.

For the citric acid variable it is hard to tell the distribution type from the histogram.

## Outliers identified: 1 nPropotion (%) of outliers: 0.1 nMean of the outliers: 1 nMean without removing outliers: 0.27 nMean if we remove outliers: 0.27 n

The citric acid variable only had one outlier, therefore even after taking it out the distribution didn’t change.

Univariate Analysis

By visualizing each variable individually, several observations regarding the dataset cand be made.

One of the most important is regarding the quality variable, which is also the focus of this analysis. The first question in this analysis is about the most common quality that wine have in the dataset. The histogram shows that the most common wine quality in the dataset is 5, closely followed by quality 6. The histogram below shows the overview using the rating variable, where the quality scores were categorized in three groups.

What is the structure of your dataset?

Initialy the structure of the dataset was comprised of 1599 observations and 13 variables. The quality variable was converted into a factor and stored in a new variable. Adittionaly, the quality variable was categorized into three rating groups (bad, average and good) and the results were stored in a new variable called rating. Therefore the dataset size increased to 15 variables.

What is/are the main feature(s) of interest in your dataset?

The main variable of interest in the dataset is quality. The main objective of the analysis is to see what wine characteristics influnce quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The variables that are believed to have an impact on red wine’s quality are acidity (fixed and volatile), pH, residual sugar and maybe alcohol.

Did you create any new variables from existing variables in the dataset?

As was previously mentioned, two new variables were created because the quality variable was converted into a factor and categorized into three rating groups to ease the analysis of the dataset.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

From all variables, citric acid had an unsual distribution. Fixed and volatile acidity, density and pH were the variables with a normal distribution. The rest of the variables had a right skewed distribution due to outliers. Removing the identified outliers showed a significant impact on distribution, therefore it has been decided not to remove the outliers at this point.

Bivariate Plots Section

After individualy analyzing the variables, the next step would be to analyze the relationship between variables.

Before starting analyzing the relationship between various pairs of variables, a correlation matrix will be displayed in order to identify possible correlations.

## Warning in ggcorr(red_wine[, 2:14], geom = "blank", label = TRUE, hjust =
## 0.95, : data in column(s) 'rating' are not numeric and were ignored

The correlation method used in the analysis in Pearson. It is desirable to have a correlation coefficient above +0.5 or below -0.5. For the interpretation of the correlation coefficient the commonly used rule of thumb was applied. The correlation coefficients below +0.3 and above -0.3 are not highlighted in the matrix above, since the correlation is considered negligiable. The correlation coefficients above +0.3 are colored in red while the ones below -0.3 are colored in blue.

The correlation matrix shows that quality has the strongest correlation with alcohol. The correlation coefficient is 0.48 and is a low positive correlation. Quality also has a low negative correlation with volatile acidity.

The correlation matrix shows correlation also between variuos wine characteristics. The relationships between quality and the variables with which it has a correlation coefficient above abs 0.3 will be further investigated. Additionally the wine characteristics that have a correlation coefficient of 0.5 (absolute value) or above will be further inverstigated.

Impact of alcohol on quality

The positive correlation coefficient between quality and alcohol implies that the higher the alcohol concentration the higher the wine quality. However, no causality can be implied.

The positive correlation can be observed in the graphs above, especialy for the wines with a quality score of 6 or above.

Impact of volatile acidity on quality

Another highlited relationship in the coefficient matrix was the one between quality and volatile acidity.

Th plots above highlight the negative correlation between quality and volatile acidity. The lower the volatile acidity, the higher the wine quality.

Relationships between wine characteristics

In the correlation matrix we saw that fixed acidity had a strong relationship with citric acid, density and pH.

The scatter plots above highlight the positive correlation between fixed acidity and citric acid as well as between fixed acidity and density. Therefore, as the citric acid and density go up, the fixed acidity goes up as well. When we look at the relationship between fixed acidity and pH, we see that as the value of pH goes up the value of fixed acidity goes down. This means that there is a negative correlation between the two.

Next, the relationship betwen citric acid and volatiel acidity as well as the relationship between citric acid and density will be furter studied. In the univariate analysis it has been observed that citric acid variable had an outlier. Therefore, the plots below will show the relationships with and without the outlier.

To further analyze and vizualize the relation between citric acid and volatile acididy as well as with density, scatter plots were created. The scatter plots above show the relationship between citric acid and volatile acidity. The plot on the right shows the relationship without the citric acid outlier, while in the one of the left the outlier is included. However, both plots display a moderate negative correlation between the two variables, meaning that as the volatile acidity goes up the citric acid goes down.

The same approach was taken when analyzing the relationship between citric acid and density. Here both plots show a moderate positive correlation, meaning that as the density goes up the citric acid goes up as well.

In the correlation matrix free sulfur dioxide and total sulfur dioxide had a correlation coeffiecient above 0.5.

It has been decided not remove outliers since they had a big impact on the distribution of the variable. Therefore the analysis will be performed using two scenarius, with and without the outliers to see the impact of outliers.

## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 76 rows containing missing values (geom_point).

It can be seen that with and without outliers, there is a positive correlation between total sulfur dioxide and free sulfur dioxide. When one increases the other one increases as well.

In the correlation matrix, density and alcohol showed that their relationship should be further explored.

The scatter plot above help visualize the relationship highlited by the correlation coefficient matrix. It can be seen that between density and alcohol is a moderate negative correlation. Therefore, as the alcohol concentration increases the density decreases.

Bivariate Analysis

The bivariate plots enabled a better overview of the data, especially on the relationship between quality and the wine characteristics as well as among the characteristics.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality has a low positive correlation with alcohol, meaning that wines with higher alcohol concentration are expected to have a higher quality. Aditionally, quality apears to have a low negative correlation with volatile acidity, meaning that wines with low volatile acidity are expected to have a higher quality.

An interesting observation is that residual sugar and pH seem to have no relationship with quality. Also, the correlation between fixed acidity and quality is very low. These results are unexpected since the initial asumption was that these wine characteristics will have a certain influence on quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Another interesting finding was that fixed acidity had a strong correlation with three other wine characteristics: citric acid, density and pH.

What was the strongest relationship you found?

The strongest relationship was between fixed acidity and pH, with a correlation coeffiecient of -0.68.

Multivariate Plots Section

The previous analysis enabled the identification of relationships between various wine characteristics. In this section, these relationships will be further analyzed by adding new variables in the relationships.

Quality showed the highest correlation with alcohol, low positive towards moderate correlation. But at the same time, alcohol showed a moderate negative correlation with density. Therfore, it would be interesting to see what is the relationship between alcohol and density at various levels of quality.

Since the relationships between alcohol and quality as well as alcohol and density weren’t very strong from the begining, also the impact on quality is not very easy to observe. However, it seems that the scatter plot reinforces previous findings. It can be seen that most of the wines with an alcohol concentration bellow 10, have a density between 0.995 and 1 and the predominant quality is 5. As the alcohol concentration increases the plot shifts to the left, where we see that density is below 0.995 and the predominant wine quality is 6 and 7.

Quality showed a low negative relationship with volatile acidity, which had a moderate negative relationship with citric acid.

The low negative correlation between volatile acidity and quality can be seen here as well. However, citric acid seems to have no impact on quality.

An interesting finding in the bivariate analysis was the relationship of fixed acidity with other three wine caracteristics: density, pH and citric acid. It would interesting to see how these relationship look like when the quality variable is added to the mix.

Fixed acidity seams to have relationships with variables that based on the results of the correlation matrix have no impact on quality. These relationships are also supported by the plots above. If the relationship between fixed acidity and citric acid, pH and density is clear, no clear relationship with quality can be seen.

Multivariate Analysis

Multivariate plots reinforced the findings in the bivariate plots. Therefore, it can be said that high quality wines are expected to have a high alcohol concentartion. While wines with high volitile acidity are expected to have low quality.

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The multivariate plots were in line with the ones from the bivariate analysis.

Were there any interesting or surprising interactions between features?

An interesting observation was the fact the relationships between the wine characteristics had no impact on quality. ——

Final Plots and Summary

The purpose of this project was to analyze the red wine dataset, in order to understand the relationship between various parameters that can impact wine’s quality.

In order to perform this exploratory data analysis three main questions were asked:
1. What is the most common wine quality?
2. Which wine characteristics have a higher impact on quality?
3. What relationships can be observed between wine caracteristics?

In order to answer these questions, univariate, bivariate and multivariate analyses were performed.

The univariate analysis enabled a good overview of the frequency distribution for each variable. We were able to see that some variables like density, pH, volatile and fixed acidity have a normal distribution, while the other variables (except quality) have a right sweked distribution.

Plot One: Quality Distribution

Description One

The histogram shows that the most common wine quality was 5, with almost 700 wine varieties having this quality score. Furthermore, the quality scores were grouped into three main categories: bad, average and good. Based on this classification, the vast majority of wine varieties were in the average category.

Using this plot we were able to answer the first question of the analysis, namely: What is the most common wine quality?

Plot Two

The bivariate analysis enabled us to see relationships between variables in the dataset. Therefore, we were able to identify the wine characteristics that have the highest impact on wine quality.

Description Two

The boxplot on the left shows the relationship between alcohol and wine quality. The higher the alcohol concentration the higher the quality. The boxplot on the right shows the relationship between volatile acidity and alcohol. The higher the volatile acidity the lower the quality.

Alcohol and volatile acidity are the two wine characteristics that have the highest impact on alcohol. This is also the answer to the second question in the analysis, “Which wine characteristics have a higher impact on quality?”. This relationship can also be seen in the correlation matrix below.

Plot Three

Description Three

The correlation matrix above shows the relationships between the variables in the dataset. All the correlation coefficients with an absolute value above 0.3 are highlighted.

Aside from showing the wine characteristics that influence the quality, the matrix also shows the relationship between other various characteristics. It is interesting to see that fixed acidity has significant correlation with three other variables, namely: pH, density and citric acid. Another moderate to strong correlation can be seen between free sulfur dioxide and total sulfur dioxide.

By highlighting all these relationships between variables, the matrix enabled us to answer the third question of the analysis “What relationships can be observed between wine caracteristics?. ——

Reflection

The univariate analysis enabled me to have a good understanding of each variable. Moreover, I was able to see what was the distribution of quality scores in the dataset. It was surprising to see that the quality scores ranged from 3 to 8, eventhough the scores could have had a value between 0 and 10.

My initial assumption was that acidity (fixed and volatile), pH, residual sugar and alcohol would have the highest impact on alcohol. However, I was surprised to see that the correlation between wine characteristics and wine quality was very low, none of the correlation coefficients having and absolute value of 0.5 or above. Only volatile acidity and alcohol have an correlation coefficient above 0.3 (absolute value). Another interesting finding was that fixed acidity has a moderate correlation with other three characteristics: citric acid, density and pH.

The biggest challenge I faced during this analysis was choosing the right amount of graphs and information to be shared. For future analysis I think it would be good to also include wine varieties that have quality scores below 3 and above 8. Since the impact of the wine characteristics in this dataset seem to have low impact on quality (absolute value of the correlation coefficient below 0.5) it might be interesting to also include others characteristics in the future, like smell or flavour.